# LatentBreak README

This code can be used to run **LatentBreak** against language models and evaluate their completions through the **Harmbench classifier**(cais/HarmBench-Llama-2-13b-cl). Follow the steps below to set up your environment, generate representations, perform the optimization, and evaluate results.

---

## Prerequisites

1. **Install Python dependencies**

   ```bash
   pip install -r requirements.txt
   ```

2. **Set up environment variables**
   The following environment variables are required:

   * `HF_TOKEN`: Your Hugging Face access token.
   * `OPENAI_API_KEY`: Your OpenAI API key.

   You can export them in your shell (e.g., Bash or Zsh):

   ```bash
   export HF_TOKEN="your_huggingface_token"
   export OPENAI_API_KEY="your_openai_api_key"
   ```

---

## Step 1: Create the Representation Dataset

The representation dataset is used to compute the target centroid needed for the optimization process in latentBreak. This step will:

* Download and preprocess the Alpaca dataset.
* Build the training split.
* Convert each prompt into hidden-state representations for selected models.

### Usage

Run the provided script with your desired arguments:

```bash
python create_representation_dataset.py \
  --model-names llama2-7b gemma-7b vicuna-13b \
  --device cuda:0 \
  --batch-size 128 \
  --save-dir ./dataset/representations
```

#### Arguments

* `--model-names`, `-m`
  List of model identifiers to process (e.g., `llama2-7b`, `gemma-7b`, `vicuna-13b`, `mistral7b`, `phi-mini`, `llama3-8b`, `qwen7b`, `r2d2`, `mistral7b-rr`, `llama3-8b-rr`).
* `--device`, `-d`
  Target device for inference (e.g., `cuda:0` or `cpu`).
* `--batch-size`, `-b`
  Number of samples to process (default: `128`).
* `--save-dir`, `-s`
  Directory in which to save the hidden-state representations (default: `./dataset/representations`).

### What Happens
After successful execution, you will find files like:

```bash
./dataset/representations/llama2-7b/HLx_train.pt
./dataset/representations/gemma-7b/HLx_train.pt
...etc.
```

---

## Step 2: Run the Optimization and Evaluation

Use `main.py` to launch latentBreak optimization, generate adversarial prompts, obtain model responses, and evaluate them with the Harmbench classifier.

### Usage

```bash
python main.py \
  --model llama2-7b \
  --main_device cuda:0 \
  --evaluator_device cuda:1 \
  --substitutor gpt \
  --subs 20 \
  --layer 31 \
  --test_file ./dataset/raw/advbench_pair.csv \
  --num_samples 100 \
  --judge True \
  --evaluator True
```

#### Key Arguments

* `--model`
  Which language model to attack (default: `llama2-7b`).
* `--main_device`
  Device for running the primary model (e.g., `cuda:0`).
* `--evaluator_device`
  Device for the Harmbench classifier (e.g., `cuda:1`).
* `--substitutor`
  Substitution model (`gpt` or `mbert`).
* `--subs`
  Number of candidate substitutions per word (default: `20`).
* `--layer`
  decoder layer on which we will compute representations (e.g., `31`).
* `--test_file`
  Path to the CSV with harmful request.
* `--num_samples`
  How many prompts to process (default: `50`).
* `--judge`
  If `True`, during the optimization we will check if the modified prompt keeps the original intent (default: True)
* `--evaluator`
  If `True`, runs the Harmbench classifier in parallel to judge each generated completion. If `False`, only returns optimized prompts; completions generation and evaluation can be performed later in **Step 3** (coming next).

### Outputs

* A JSON results file named according to the experiment parameters, containing entries for each prompt:

  ```json
  {
    "0": {
      "optimization": ["optimized prompt text"],
      "response": "model's completion",
      "is_jailbreak_harmbench": true
    },
    "1": { ... }
  }
  ```

* When `--evaluator=True`, each entry includes:

  1. **Optimized prompts at each iteration**
  2. **Model response at the last iteration**
  3. **Harmbench judgment**

* When `--evaluator=False`, the Harmbench step is deferred to Step 3.

---

## Step 3 (Optional): Deferred Harmbench Evaluation

If you ran `main.py` with `--evaluator=False`, you can perform the Harmbench classification and final selection in a separate script. Save the following code as `evaluate_harmbench.py` in your project root:


### Usage

```bash
CUDA_VISIBLE_DEVICE=0 python generate_completions_eval_HB.py \
  --optimization_path ./Results/model/filename.json \
  --csv_path ./dataset/raw/advbench_pair.csv \
  --evaluation_path ./final_evaluation.json \
```

#### Arguments

* `--optimization_path`
  The JSON file produced by `main.py` when `--evaluator=False`.
* `--csv_path`
  The original CSV of prompts with their prompt categories.
* `--evaluation_path`
  Where to save the evaluated results (default: `./evaluation_results.json`).

### Outputs

* A JSON file listing, for each sample:

  * `"prompt"`: the original optimized prompt.
  * `"final_adv"`: the selected adversarial prompt.
  * `"response"`: the model’s generated completion.
  * `"is_jailbreak_harmbench"`: `1` if Harmbench flagged it as jailbreak, else `0`.

---

With Step 3 complete, we have the final LatentBreak's ASR against the considered victim model.
